Los Angeles County provides openly available data on all restaurant and market inspections over the past 5 years. Facilities are subject to inspection 1 to 3 times a year, and made public within 1 week of inspection date. The frequency in which restaurants and food markets are inspected depends on the public health risk associated with the food products served or prepared and on the facility’s history of inspection grades. Inspectors deduct points based on violations and health risks, which is turned into a score out of 100. In addition, Los Angeles County data from 2018 on population health is publicly available. Demographic data, such as age and race distribution, socioeconomic data, such as proportion receiving EBT and proportion employed, and health outcomes data, such as proportion with asthma and rates of suicide, are provided for 87 cities within Los Angeles County.
I am interested in exploring restaurant inspection ratings in LA County. I have a few questions, with the main one being: Are restaurant inspection ratings associated with community health status? Secondary questions include: What are the “safest” and most “dangerous” cities in LA County for eating restaurant food? What restaurant chain is the “safest” to eat at? What restaurant chains should one proceed with caution?
I used 2 data sets which I merged together for this project. Both are available at data.lacounty.gov to download as CSV (I have also uploaded these datasets to my github repository). The first is a dataset of all LA County restaurant inspections. I added LA County city health data to see if there were any relationships between public health outcomes and local restaurant hygiene. These datasets were merged by city name, and no restaurant was missing its city in the first dataset. City names were briefly inspected to ensure matching would be feasible. As Los Angeles had many sub-cities for which there was health data, I only included the “City of Los Angeles” data to represent the local public health for any restaurant with city listed as Los Angeles. Only restaurants in cities with health data were included in this analysis.
if (!file.exists("LACinspections.csv.zip"))
download(
url = "https://raw.githubusercontent.com/v-yin/PM566-Midterm/LACinspections.csv.zip",
dest = "LACinspections.csv.zip",
mode="wb"
)
unzip("LACinspections.csv.zip", exdir="./")
inspect <- read.csv("LACinspections.csv")
if (!file.exists("LAChealth.csv"))
download(
url = "https://raw.githubusercontent.com/v-yin/PM566-Midterm/main/LAChealth.csv",
dest = "LAChealth.csv",
mode="wb"
)
health <- read.csv("LAChealth.csv")
health$GEONAME <- toupper(health$GEONAME)
health$GEONAME <- replace(health$GEONAME, health$GEONAME=="LOS ANGELES, CITY OF", "LOS ANGELES")
resthealth <- merge(
x = inspect,
y = health,
by.x = "FACILITY_CITY",
by.y = "GEONAME",
all.x = FALSE,
all.y = FALSE
)
resthealth <- data.table(resthealth)
Data were explored utilizing R package ggplot2 to create a histogram of scores. Implausibly low scores were deleted (score less than 50, of which there was 1 value with score 3). Average scores within cities were computed and compared. Restaurant chains were identified through tokenizing words as bigrams, and looking to see most common chain restaurants. The following set of 9 chains were selected: McDonald’s, Jack in the Box, Starbucks, El Pollo Loco, Panda Express, Taco Bell, Del Taco, In N Out, Panera Bread. Average chain inspection scores were computed and compared. Measures of public health were selected: proportion with depression, proportion with obesity, proportion with diabetes. To explore a chain restaurant inspections scores with its surrounding city’s health, average score for a chain restaurant was calculated.
In the future, I hope to use OpenStreetMaps API to gather latitude and longitude data for each restaurant. For now, I was able to use geocode_zip to gather lattitude and longitude for most zip codes in the dataset. The zip codes’ average inspection score, proportion of diabetes, obesity, and depression were calculated. Heat maps of the averages were created to compare spatially where clean/dirty restaurants are in relation to proportion of the population with Diabetes, obesity, or depression.
# Delete scores less than 50
library(data.table)
resthealth <- resthealth[SCORE>50]
# Find average inspection score by city
city_avg <- resthealth[ , .(
scoreavg = mean(SCORE)
), by = "FACILITY_CITY"]
# Find restaurant chains
resthealth %>% unnest_ngrams(word, FACILITY_NAME, n=2) %>% anti_join(stop_words, by = c("word")) %>% count(word, sort=TRUE) %>% as_tibble() %>% print(n=100)
## # A tibble: 40,730 × 2
## word n
## <chr> <int>
## 1 <NA> 16432
## 2 ralphs market 2066
## 3 7 eleven 2009
## 4 meat market 1907
## 5 mexican grill 1488
## 6 in the 1444
## 7 the box 1441
## 8 el pollo 1416
## 9 jack in 1408
## 10 starbucks coffee 1343
## 11 pollo loco 1297
## 12 mini market 1250
## 13 panda express 1176
## 14 vons market 1151
## 15 mexican food 1124
## 16 taco bell 969
## 17 4 less 948
## 18 food 4 924
## 19 eleven store 855
## 20 los angeles 826
## 21 chipotle mexican 804
## 22 whole foods 800
## 23 pizza hut 793
## 24 el super 774
## 25 domino's pizza 760
## 26 cvs pharmacy 759
## 27 little caesars 737
## 28 carl's jr 729
## 29 foods market 728
## 30 fried chicken 717
## 31 burger king 707
## 32 fish market 702
## 33 ice cream 684
## 34 superior grocers 663
## 35 dodger stadium 656
## 36 waba grill 656
## 37 stater bros 651
## 38 smart final 616
## 39 ranch market 588
## 40 mexican restaurant 585
## 41 farmers market 554
## 42 wal mart 549
## 43 hawaiian bbq 543
## 44 afc sushi 519
## 45 bakery cafe 516
## 46 del taco 516
## 47 la michoacana 514
## 48 subway sandwiches 514
## 49 sprouts farmers 497
## 50 liquor market 494
## 51 tacos el 491
## 52 northgate market 483
## 53 costco wholesale 481
## 54 thai restaurant 478
## 55 bar grill 469
## 56 thai cuisine 469
## 57 hong kong 465
## 58 jamba juice 444
## 59 jersey mike's 444
## 60 house of 441
## 61 rite aid 440
## 62 chinese food 438
## 63 baskin robbins 435
## 64 the coffee 433
## 65 dollar tree 426
## 66 mike's subs 422
## 67 99 cents 421
## 68 numero uno 415
## 69 coffee bean 409
## 70 center levy 403
## 71 staples center 403
## 72 tea leaf 403
## 73 flame broiler 402
## 74 vallarta supermarket 401
## 75 bean tea 400
## 76 coffee shop 396
## 77 juice bar 381
## 78 beverly hills 375
## 79 mariscos el 368
## 80 the flame 365
## 81 japanese restaurant 360
## 82 in n 356
## 83 n out 356
## 84 korean bbq 354
## 85 fast food 351
## 86 out burger 350
## 87 drive in 347
## 88 and grill 341
## 89 burger grill 334
## 90 country club 329
## 91 mini mart 323
## 92 yum yum 318
## 93 seafood restaurant 311
## 94 am pm 310
## 95 the habit 308
## 96 noodle house 307
## 97 santa monica 307
## 98 bristol farms 304
## 99 gelson's market 303
## 100 sam's club 302
## # … with 40,630 more rows
# Create chain variable
resthealth$FACILITY_NAME <- toupper(resthealth$FACILITY_NAME)
resthealth$CHAIN <- ifelse(grepl("MCDONALD", resthealth$FACILITY_NAME), "McDonald's", ifelse(grepl("JACK IN THE BOX", resthealth$FACILITY_NAME), "Jack in the Box", ifelse(grepl("STARBUCKS", resthealth$FACILITY_NAME), "Starbucks", ifelse(grepl("EL POLLO LOCO", resthealth$FACILITY_NAME), "El Pollo Loco", ifelse(grepl("PANDA EXPRESS", resthealth$FACILITY_NAME), "Panda Express", ifelse(grepl("TACO BELL", resthealth$FACILITY_NAME), "Taco Bell", ifelse(grepl("DEL TACO", resthealth$FACILITY_NAME), "Del Taco", ifelse(grepl("OUT BURGER", resthealth$FACILITY_NAME), "In N Out", ifelse(grepl("PANERA BREAD", resthealth$FACILITY_NAME), "Panera Bread", NA)))))))))
# Find average inspection score by city
chain_avg <- resthealth[ , .(
scoreavg = mean(SCORE)
), by = "CHAIN"]
# Clean health outcome data to be numeric
resthealth$Prop_obse <- as.numeric(as.character(resthealth$Prop_obse))
## Warning: NAs introduced by coercion
resthealth$Prop_DM <- as.numeric(as.character(resthealth$Prop_DM))
## Warning: NAs introduced by coercion
# Average score by restaurant
zip_avg <- resthealth[ , (ZipAvg = mean(SCORE, by = "FACILITY_ZIP")), list(FACILITY_NAME, FACILITY_ADDRESS, FACILITY_CITY, FACILITY_STATE, FACILITY_ZIP, Prop_DM, Prop_obse, Prop_depr) ]
zip_avg$FACILITY_ZIP <- as.character(zip_avg$FACILITY_ZIP)
# geocode to get lattitude and longitude by zip code
library(zipcodeR)
zip_cord <- geocode_zip(zip_avg$FACILITY_ZIP)
zip <- merge(
x = zip_avg,
y = zip_cord,
by.x = "FACILITY_ZIP",
by.y = "zipcode",
all.x = TRUE,
all.y= FALSE
)
# attempt to get restaurant lat/lng
# failed: geocode_osm , can't install SUNGEO
# failed: tidygeocoder, takes hours
# myaddress.df <- resthealth[, query := paste0(FACILITY_ADDRESS, ", ", FACILITY_CITY, ", CA ", FACILITY_ZIP)]
# address <- geo(address = myaddress.df$query, method = "osm")
In total, there were 252773 inspections of 39423 restaurants in 66 cities within LA County.
Of the 252773 inspections included in the analysis, the average grade was 93.9351117 with a standard deviation of 3.8669866. The highest score was a perfect score, 100 whereas the lowest score was 54. Interestingly, there appears to be a peak at 90, which corresponds to the lowest possible score to achieve an A rating. This may hint at bias involved in the inspection grading process.
library(ggplot2)
scorehisto <- ggplot(resthealth, aes(x=SCORE)) +
geom_histogram(binwidth=1, fill="red") +
ggtitle("Distribution of LA County Restaurant Inspection Scores, 2017-2022") +
xlab("Inspection Score")
scorehisto
Of the 66 cities included in this analysis, Long Beach had the highest average city inspection score. The city with the worst average score was Monterey Park.
city_avg %>% arrange(desc(scoreavg)) %>% slice(1:5) %>% knitr::kable(col.names = c("City", "Average Score"), caption = "Top 5 Cities")
| City | Average Score |
|---|---|
| LONG BEACH | 97.37500 |
| CALABASAS | 96.03195 |
| SANTA CLARITA | 95.74072 |
| MAYWOOD | 95.60426 |
| EAST LOS ANGELES | 95.49254 |
city_avg %>% arrange(scoreavg) %>% slice(1:5) %>% knitr::kable(col.names = c("City", "Average Score"), caption = "Bottom 5 Cities")
| City | Average Score |
|---|---|
| MONTEREY PARK | 91.63380 |
| ROWLAND HEIGHTS | 91.68030 |
| ALHAMBRA | 92.36034 |
| GARDENA | 92.36540 |
| CERRITOS | 92.59487 |
Of the chain restaurants examined, Panda Express had the best average score, whereas Del Taco had the lowest average score. Cities with Starbucks and Panera appeared to have the lowest proportion of diabetes. However, depression was highest in cities with Starbucks and Panda Express. Cities with Del Taco and Panera Bread had the highest amount of obesity.
chain_avg %>% na.omit() %>% arrange(desc(scoreavg)) %>% slice(1:10) %>% knitr::kable(col.names = c("Chain", "Average Score"), caption = "Chain Ranking")
| Chain | Average Score |
|---|---|
| Panda Express | 97.01446 |
| In N Out | 96.89239 |
| Starbucks | 96.62458 |
| Taco Bell | 96.52941 |
| McDonald’s | 95.15971 |
| El Pollo Loco | 94.92152 |
| Jack in the Box | 94.79830 |
| Panera Bread | 94.62241 |
| Del Taco | 93.16085 |
resthealth[ , .(
avgDM = mean(Prop_DM, na.rm = T),
avgMDD = mean(Prop_depr, na.rm = T),
avgOB = mean(Prop_obse, na.rm = T)
), by = "CHAIN"] %>% na.omit() %>% as_tibble() %>% knitr::kable(col.names= c("Chain", "Average Proportion Diabetes", "Average Proportion Depression", "Average Proportion Obesity"), caption = "Proportion of Diabetes, Depression, and Obesity in surrounding city by chain restaurant")
| Chain | Average Proportion Diabetes | Average Proportion Depression | Average Proportion Obesity |
|---|---|---|---|
| Starbucks | 0.0984833 | 0.0875348 | 0.2299143 |
| Taco Bell | 0.1034938 | 0.0831847 | 0.2472439 |
| McDonald’s | 0.1036016 | 0.0826779 | 0.2438592 |
| In N Out | 0.1005722 | 0.0750168 | 0.2412087 |
| Panda Express | 0.1010105 | 0.0841394 | 0.2398609 |
| Jack in the Box | 0.1053523 | 0.0794227 | 0.2500741 |
| Panera Bread | 0.0976643 | 0.0812643 | 0.2606151 |
| El Pollo Loco | 0.1034037 | 0.0799796 | 0.2445605 |
| Del Taco | 0.1051764 | 0.0806202 | 0.2659273 |
While most zip codes appear to have a decent average health inspection score (90+), there appear to be some areas where scores are lower. These include El Monte/Covina, Compton, and Central LA (West Adams). There even appears to be one zip code in Santa Monica with a poor average inspection score. In the following maps, I will explore spatial patterns in proportion of diabetes, obesity, and depression to see if it mirrors patterns found in the food inspection scores by zip code.
score.pal <- colorNumeric("inferno", domain=zip$V1)
scoremap <- leaflet(zip) %>%
addProviderTiles("CartoDB.Positron") %>%
addCircles(
lat = ~lat, lng = ~lng,
label = ~paste0(V1), color = ~score.pal(V1),
opacity = 1, fillOpacity = 1, radius = 200
) %>%
addLegend("bottomright", pal=score.pal, values = zip$V1, title= "Inspection Score", opacity=1) %>% addTiles() %>%
addControl("Average Inspection Scores by Zip Code", position = "bottomleft")
## Warning in validateCoords(lng, lat, funcName): Data contains 124 rows with
## either missing or invalid lat/lon values and will be ignored
scoremap
The distribution of the population proportion of diabetes in LA County has a clearer spatial pattern than those of restaurant inspection scores. It appears that diabetes is lowest in communities bordering the ocean and Glendale. Diabetes is most prevalent in South/Southwest/Southeast Los Angeles. These are also some of the communities where restaurant scores are lower.
DM.pal <- colorNumeric("viridis", domain=zip$Prop_DM)
DMmap <- leaflet(zip) %>%
addProviderTiles("CartoDB.Positron") %>%
addCircles(
lat = ~lat, lng = ~lng,
label = ~paste0(Prop_DM), color = ~DM.pal(Prop_DM),
opacity = 1, fillOpacity = 1, radius = 200
) %>%
addLegend("bottomright", pal=DM.pal, values = zip$Prop_DM, title= "Proportion in City with Diabetes", opacity=1) %>% addTiles()
## Warning in validateCoords(lng, lat, funcName): Data contains 124 rows with
## either missing or invalid lat/lon values and will be ignored
DMmap
Obesity proportion by city follows patterns similar to that of Diabetes. It appears that the city of Inglewood has one of the highest proportion of obesity, and it also does have poorer food inspection grades. South LA appears to be one of the areas with worst restaurant inspection scores and highest in comorbidities.
ob.pal <- colorNumeric("plasma", domain=zip$Prop_obse)
obmap <- leaflet(zip) %>%
addProviderTiles("CartoDB.Positron") %>%
addCircles(
lat = ~lat, lng = ~lng,
label = ~paste0(Prop_obse), color = ~ob.pal(Prop_obse),
opacity = 1, fillOpacity = 1, radius = 200
) %>%
addLegend("bottomright", pal=ob.pal, values = zip$Prop_obse, title= "Proportion in City with Obesity", opacity=1) %>% addTiles()
## Warning in validateCoords(lng, lat, funcName): Data contains 124 rows with
## either missing or invalid lat/lon values and will be ignored
obmap
Interestingly, depression appears to take a different spatial pattern than diabetes and obesity. Here, depression is clustered around Santa Monica, with one zip code highly prevalent in depression near Inglewood. It does not seem that restaurant hygeine correlates strongly with depression proportions.
dep.pal <- colorNumeric("BuPu", domain=zip$Prop_depr)
depmap <- leaflet(zip) %>%
addProviderTiles("CartoDB.Positron") %>%
addCircles(
lat = ~lat, lng = ~lng,
label = ~paste0(Prop_depr), color = ~dep.pal(Prop_depr),
opacity = 1, fillOpacity = 1, radius = 200
) %>%
addLegend("bottomright", pal=dep.pal, values = zip$Prop_depr, title= "Proportion in City with Depression", opacity=1) %>% addTiles()
## Warning in validateCoords(lng, lat, funcName): Data contains 124 rows with
## either missing or invalid lat/lon values and will be ignored
depmap
There appears to be some correlation between poor health and poor restaurant inspection scores. The communities where this was apparent were clustered in South LA (ie Compton, Inglewood), and Eastern Los Angeles. Inglewood appears to be one of the hardest hit cities with regards to poor inspection scores, high proportion of diabetes and obesity, and high proportion of depression. The safest chain restaurant appears to be Panda Express, whereas one should be cautious of Del Taco. San Gabriel Valley cities, like Monterey Park, Alhambra, and Rowland Heights, were found to have the lowest inspection scores in LA County.